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Abstract. We extend an hypergraph representation, introduced by Finkelstein and Roytberg, to unify 
dynamic programming algorithms in the context of RNA folding with pseudoknots. Classic applica- 
tions of RNA dynamic programming (Energy minimization, partition function, base-pair probabili- 
ties. . . ) are reformulated within this framework, giving rise to very simple algorithms. This reformu- 
lation allows one to conceptually detach the conformation space /energy model - captured by the 
hypergraph model - from the specific application, assuming unambiguity of the decomposition. To 
ensure the latter property, we propose a new combinatorial methodology based on generating func- 
tions. We extend the set of generic applications by proposing an exact algorithm for extracting gener- 
alized moments in weighted distribution, generalizing a prior contribution by Miklos and al. Finally, 
we illustrate our full-fledged programme on three exemplary conformation spaces (secondary struc- 
tures, Akutsu's simple type pseudoknots and kissing hairpins). This readily gives sets of algorithms 
that are either novel or have complexity comparable to classic implementations for minimization 
and Boltzmann ensemble applications of dynamic programming. 
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1 Introduction 

Motivation. Over the past decades biology as a field has become increasingly aware of the importance 
and diversity of roles played by ribonucleic acids (RNA). In addition to playing house-keeping parts, as 
initially contemplated by the proteo- centric view of cellular processes, RNA is now accepted as a major 
player of gene regulation mechanisms. For instance silencing activity (miRNAs, siRNAs) or multi- stable 
cis- regulatory elements (rib o switches) are currently the subject of many research. Furthermore a recent 
genome-wide experiment has revealed that a large portion of the human genome was subject to tran- 
scription into RNA. While it is unlikely for all these transcripts to be functional as RNAs, novel classes 
and roles are currently under investigation. Most of the functional roles played by RNA require the RNA 
to adopt a specific structure to make an interaction possible, hide/ exhibit an active site or allow for a 
catalytic action (Ribozymes). Being able to understand and simulate how RNA folds is therefore a crucial 
step toward understanding its function. 

Ab initio secondary structure prediction. Initial algorithmic methods for the ab-initio prediction of RNA 
folding considered a coarse-grain conformation space, the secondary structure, where each conforma- 
tion is defined as a non- crossing subset of admissible base-pairs. This led Nussinov and Jacobson |39 1 to 
design a 0(n^) dynamic-programming (DP) algorithm for the base-pair maximization problem. Build- 
ing on a nearest neighbor free- energy model proposed by Tinoco etal \ 5l \ and extended by the Turner 
group, Zuker and Stiegler f56\ created MFold, a 0(/?^) algorithm for minimizing the free- energy (MFE 
folding), later shown to predict correctly ---73% of base-pairs on a benchmark of RNAs of length < 700 
nucleotides |34l . An independent implementation of the algorithm is proposed within the popular Vien- 
naRNA package maintained by Hofacker |22|. Probabilistic alternatives (SFold |11|, ContraFold [T4l 
and CentroidFold |20 |) have also recently been proposed with substantial improvement, relying on a 
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dynamic programming scheme similar to that of MFold to traverse the conformation space in polyno- 
mial time coupled with some postprocessing steps. 

Ensemble approaches. Since the seminal work of McCaskill ES1> the concept of Boltzmann equilibrium 
has been used to embrace the diversity of folding accessible to an RNA sequence. He showed that the par- 
tition function of an RNA - a weighted sum over the set of all compatible structures - could be computed 
through a simple transposition of the DP scheme used for MFE folding. Coupled with a variant of the 
inside /outside algorithm, this led to an exact computation of base-pairs probabilities in the Boltzmann- 
weighted ensemble. This opened the door for more robust predictions, e.g. for RNAs whose MFE folding 
is an outlier. This intuition was later validated by Mathews |33 1 who showed that the Boltzmann prob- 
ability correlated well with the actual presence of base-pairs in experimentally- determined structures. 
Ding etal fTT\ pushed this paradigm shift a step further by clustering sets of structures sampled within the 
Boltzmann distribution and computing a consensus, improving on the positive-predictive-value (PPV) of 
existing algorithms. This ensemble view naturally spread toward other applications of DP in Bioinformat- 
ics (sequence alignement |38|, simultaneous alignment and folding |21 1, 3D structural alignement 1 15|), 
and is increasingly becoming a part of the algorithmic toolbox of bioinformaticians. 
Pseudoknotted conformations. Although substantially successful in their task, secondary structure pre- 
diction algorithms were intrinsically limited in by their inability to explore conformations featuring cross- 
ing base-pairs. Such motifs, called pseudoknots, were initially excluded from the conformation space 
based on the rationale that their participation to the free -energy would remain limited. Furthermore, the 
adjunction of all possible pseudoknots was shown to turn MFE folding into an NP- complete problem 
even in a simple nearest-neighbor model |1 30 1. However such conformations do naturally occur, and 
can be essential to functional mechanisms such as -1-frameshift recoding events ID or the formation of 
tertiary motifs f40|. Therefore many exact DP approaches f4 5l3Qll3l42l6l7l8l7l23l5Q T44l have been pro- 
posed over the years to extract the MFE structure within restricted - polynomially solvable - classes of 
pseudoknots. However most of these approaches (with the notable exceptions of 113161441 ) were based on 
ambiguous DP schemes, leading them to consider certain structures multiple times. While such an un- 
ambiguity would not be worrisome in the context of energy minimization, it prevents a direct transposi- 
tion of these algorithms to ensemble applications (partition function, base-pair probabilities) by heavily 
biasing - for no biologically valid reason - derived estimates. 

Unambiguous decompositions. This lack of focus on unambiguity in the design of RNA (pseudoknotted) 
DP algorithms can be explained by two main reasons. Firstly certain conformation spaces may not ad- 
mit unambiguous schemes. Indeed it has been shown by Condon et al 1 9 1 that many PK conformational 
spaces can be modeled as a formal language, while Flajolet 1 18 1 had shown, using a combinatorial argu- 
ment, that certain simple context-free languages are inherently ambiguous, i.e. not generated by any un- 
ambiguous context-free grammar. A second explanation is more historical: DP algorithms designers were 
initially focused on optimization problems, and considered the DP equation, not the decomposition of 
the search space, as the central object of their contributions. Indeed in the optimization perspective, it 
is not mandatory for the conformation space to be completely (e.g. sparsification) or unambiguously 
(e.g. multiply occurring best structure) generated. As decompositions grow more and more complex to 
capture more complex energy models and topological limitations, these two key properties are becoming 
increasingly hard to ascertain at the level of DP equations. Consequently there is a need for more rational 
framework to facilitate the design of conformational spaces. 

Combinatorial dynamic programming. Over the last century, enumerative combinatorics as a field has 
been focusing on providing elegant decompositions for all sorts of objects. Our proposal is to adopt a sim- 
ilar discipline in the design of DP decompositions, the only task worthy of human attention to our opin- 
ion, and will eventually lead to an automated procedure for the actual production of codes/ algorithms. 
To that purpose we chose to build on and revisit an hypergraph analogy proposed by Finkelstein etal 1 16] 
as a unifying framework for RNA folding and other applications of DP in Bioinformatics, which we gen- 
eralize into combinatorial classes amenable to analysis using generating functions. 
Related work. The two main frameworks offering abstracts view over Dynamic Programming are Lefeb- 
vre's multi-tape attributed grammars |26| and Giegerich's Algebraic Dynamic Programming (ADP) 1 19l, 
respectively building on multitape- attributed grammars and context-free grammars. Although very ele- 
gant and mature in their implementations, they suffer from limitations in expressivity that are intrinsic 
to their underlying formalisms. For instance, ADP has to resort to an explicit manipulation of indices 
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Fig. 1. Illustration of F-Graphs, F-Paths and Independence property. Straight lines indicate classic arcs, and bent 
lines indicate hyperarcs. 

in order to achieve competitive complexities for canonical pseudoknots |42|, while Lefebvre's multi- 
tape grammars |27l require increased complexity to capture pseudoknots. Another formal description 
of pseudoknotted search spaces is M. MohFs split-types |37|, which focuses on how non- contiguous 
portions are combined, providing a very compact description for pseudoknotted conformation spaces. 
Compared to these abstract representations, the hypergraph formalism achieves a greater expressivity 
by: i) Implementing an unordered product; ii) Allowing explicit manipulation of indices; iii) Allowing ad- 
ditional information to be stored within nodes (Remember that context-free grammars allow for a finite 
number of non-terminals). For instance, polynomial hypergraphs could be proposed for counting ho- 
mogeneous alignments |25 1 whereas these objects cannot be generated by any context-free grammar (5] 
and will not be expressed strictly within the alternative frameworks. This improved expressivity comes at 
a price since the manual manipulation of indices is error-prone, as pointed accurately by Giegerich et al, 
so one may want to think of our proposal as more of a byte code, possibly produced from a higher-level 
source code (ADP, split-types. . . ). 

Outline. In SectionEl we briefly remind some basic definitions related to forward directed hypergraphs. 
In Section |3l we remind and propose dynamic programming algorithms for generic problems on F- 
graphs. Then in SectionH) we illustrate our programme by proposing and proving unambiguous decom- 
positions for three space of conformations: Classic secondary structures in the Turner energy model l32l . 
(weighted) base-pair maximisation version of Akutsu's simple-type pseudoknots |1| and fully- recursive 
kissing hairpins (Unambiguous restriction of Chen etal |8|). We also describe a simplified proof strategy 
based on generating functions to prove the correctness of a given decomposition. Section[5]enriches the 
scope of applications of our framework by proposing a general algorithm for extracting the moments of 
additive features (free-energy, base-pairs, helices. . .) in a weighted distribution (generalizing a previous 
contribution by Miklos etal |36 1). Finally Section[6]concludes with some remarks and possible extensions 
and improvements. 

2 Notations and key notions 

Let us first remind that a directed hypergraph generalizes the notion of directed graph by allowing any 
number of vertices as origin(tail) and destination (head) for each (hyper) -arcs. We will be focusing here 
on Forward- Hypergraphs, or F-graphs, which restrict the tail of their arcs to a single vertex. 

Formally, let y be a set of vertices, an F-arc e = (tie) h(e)) EVx^{V)y connects a single tail vertex 
tie) G y to an ordered list of vertices hie) Q V. An F-graph ^ = iVy E) is characterized by a set of vertices 
V and a set of F-arcs E. Denote by c„ the children of a node in a tree, then an F-path of ^ = iVyE) is a 



tree ^ = (V' Q Vy E') such that, for any node neV'yiVn^ Cn) ^ E. For the sake of simplicity, we may omit 
the implicit V' and identify an F-path to its set of edges E'. 

An F-derivation from a vertex s e V can be recursively defined as either (5,0) if (s ^ 0) e E, or 
{s, Di ... D\t\} if is ^ i) E E y t = {tiy tzy ... y t\t\} y and each Df is an F-derivation starting from tf. An F-graph 
is acyclic if and only if any vertex 5 e y is present only once (as a root) in any derivations starting from 
s. Moreover it is independent if and only if any vertex 5 e y is reached at most once in any derivation, 
regardless of its root. 

A weighted F-graph is a triplet (VyEyTr) such that (VyE) is an F-graph and n : E ^ is r weight 
function that associates a weight to each F-arc. Finally, an oriented F-graph is a quadruplet (voy VyEyji) 
such that {VyEyjt) is a weighted independent F-graph, and i^o ^ ^ is a distinguished initial vertex. 

Remark 1: Notice that our definition of F-arcs and F-paths implicitly defines terminal vertices, since any 
leaf / in a F-path has no child and our definition of F-paths therefore requires / ^ to be an F-arc of 

Remark 2: Under the independence property, the derivations starting from any node seV are trees, and 
are therefore in bijection with F-paths originating from the same vertex. 

3 Generic problems and algorithms for F-paths in F-graphs 

In the following, terminal cases will very seldom appear explicitly, but will rather be captured by the limit 
cases of products Y[ue0 f^u) = 1 and sums Zwe0 /(^^) = 0, /c e U. 

Generating and counting F-paths in oriented F-graphs |55| Let ^ = {voyVyEyji) be an oriented F- 
graph, we address the problem of generating the set of F-paths obtained starting from vq. 

From the tree-like definition of F-paths and our remark on terminal vertices, we know that any F- 
path starting from a vertex s can either be a leaf, provided that there exists an F-arc 5 ^ 0, or an internal 
node. In the latter case, any F-paths is composed of auxiliary paths, generated from the vertices in the 
head of some F-edge having s as tail. Remark that our definition of F-paths requires each vertex from V 
to appear at most once in any F-path, a fact that is ensured here by the acyclicity of Therefore we can 
recursively define the set of of F-paths starting from a root node 5 as 



Since E is a set, the candidate heads for a given tail s are distinct and the unions in the above equations 
are disjoint. Furthermore, the products are Cartesian, so we can directly transpose the recurrence above 
over the cardinalities ns = \^s\ and obtain 



This immediately yields aGdyi-FlEI-hX^e^lhC^jD/Odl^l) time /memory dynamic programming algo- 
rithm for counting F-paths. 

Minimal score F-path Let us consider an additive scoring scheme based on weights, and accordingly 
define the score of an F-path ptobea[p)= Y.eeE Tc{e). We address here the problem of finding an F-path 
po having minimal score or more formally some po ^ such that e ^y^y p^ ^ a[p] > a[pQ). 
From the independence of siblings and the strict additivity of the score, we know that the path minimiza- 
tion problem has optimal substructure, i. e. any optimal solution is composed of optimal solutions for its 
subproblems. Consequently, the minimal score of a path starting from a root node 5 e y is given by 
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A classic backtrack procedure can then be used to reconstruct the F-path instance starting from 
seV and having minimal score. Alternatively, the previous recurrence can be modified as follows 

= argmin a [{{s ^ t)} u p') , VseV, (4) 

s.t. is^t)EE 

giving a 0(|\^| + l^*! + JleeE |h(e)|)/0(|y|) time/memory DP algorithm for the minimal weighted F-path. 



Weighted count and weighted random generation 1 10 | Let us extend multiplicatively on paths our 

weight function, defining the weight of any F-path p to be nip) = Heep Then a small modification 
of Equation [2] gives a recurrence for computing the cumulated weight, or weighted count Ws of F-paths 
starting from a given vertex s: 

Ws= ^(P')= L ^f^)' n ^s'> V5ey (5) 

Provided that the weights are positive, this defines a weighted probability distribution over F-paths, 
which assigns to each path p e a probability 

Pip\n) = ^li— - ^ ^ . (6) 

From the precomputed values Ws, one can perform a weighted random generation to draw at ran- 
dom a set of k F-paths from vq according to a weighted distribution. Starting from any vertex 5, the algo- 
rithm chooses at each step an F-arc e=is^hie)) with probability 

Ps,e = , 

Ws 

and proceeds to the recursive generation of auxiliary paths from each vertex in hie). A simple induction 
argument shows that any F-path is then generated with respect to the probability distribution of Equa- 
tion[6l The weighted count recurrence is computed by a 0(|I^| + \E\ -\- Y.eeE \hie)\)/Gi\V\) time/memory 
algorithm, and each path p is generated in 0(|p| + Leep |h(e)|)/0(|p|) time/memory. 

Remark 3: This worst- case complexity can be improved using additional information on the structure of 
the F-graph. For instance, when both the height and maximal degree of a vertex are bounded by some 
constant riy Boustrophedon search I17I41I can be used to decrease the worst- case complexity of each 
generation from 0(/?^) to ^(nlog n). 



Arc traversal probabilities Using the same probability distribution, a natural problem is to compute 
the probability pe of an F-arc e e E being in a random F-path. To that purpose one can use the classic 
inside/outside algorithm, which can be rephrased as an F-graphs traversal. 

Let us first point out that the probability pe is related to the cumulated weight of all F-paths featuring 
an edge e = (tie) hie)) through 

Lpe^^o ^ip) EpG^^o TTip) 
s.t. eep s.t. eep 

P^ = ^ ^T,= • 

From the independence of we know that each vertex appears at most once in any given F-path, and 
consequently any F-path traversing e can therefore be unambiguously decomposed into: i) An e-outside 
tree, i.e. a derivation from vq whose leaves are either terminal or tie), and which features exactly one 
occurrence of tie); ii) A support edge e = itie) hie)); iii) An e-inside tree, i.e. a set of F-paths issued 
fromh(^). 

The unambiguity of the decomposition, along with the independence of i) and iii), translates into 



^ nip) = btie) ' ^ie) • n ^s' 

P^^VQ s'ehie) 
s.t. eep 



(8) 



where bs is the cumulated weight of all outside trees leaving seV underived. Finally it can be shown that 
the cumulated weight bs over all outside trees obey the folloAA/ing simple recurrence 

bs = ls^qo-^ L n{e')-bt^e')' n ^s'> V5Gy (9) 

e'eE s'ehie') 
s. t. sehie') s'^^s 

which can computed in 0(1^1 + l^l + T.eeE \hie)\^)/Gi\V\) time/memory. The probability of traversing pe 
in a random F-path can finally be computed through the formula 

^^^ b,,eyYls'eHe^W, ^ V.e£. (10) 

4 F-graphs reformulation of (Pseudoknotted) RNA conformation spaces 

From the previous section, we know that very simple algorithms exist for weighted optimization and enu- 
meration problems over the F-paths of an F-graph. Let us now consider MFE folding-related problems 
over an arbitrary conformation space D for a sequence under an energy model E : D ^ U and as- 
sume that there exists: CI. An F-graph ^ whose F-paths ^ are in bijection with the conformation space 
D; C2. A weight function n such that the (additive) score of any F-path coincides with the energy of its 
corresponding conformation. 

Under such conditions, it can be remarked that the minimal score algorithm (Equation |3) exactly 
computes the Minimal Free-Energy MFE = min^eD Es. Furthermore, the Weighted Count (Equation[5), 
applied to a weight function n'ie) = e~^^^^^^^ , computes the Partition Function 2 = T.seD e~^''^^ . Other 
quantities of interest for RNA folding can also be derived, as summarized in Tables [T]and|2l 

4. 1 Foreword: Shortening correctness proofs through generating functions 

Our main challenge is to find an hyp ergraph/ weight such that the energy function can be expressed in 
an additive fashion. Focusing first on Condition CI, one remarks that finding a function y/:^^D which 
maps F-Paths to elements of the conformation space is not challenging, as it essentially amounts to fig- 
uring out which derivation creates which base-pairs. Condition CI is then traditionally broken into two 
parts: an unambiguity condition which requires distinct elements in ^ to give rise to distinct elements 
within D, i.e. y/ should be injective; a completeness condition which requires each element in 5 to have 
at least one pre -image, i.e. i//^ should be surjective. 

Since these notions are intimately related to the semantics associated with the F-paths, they can- 
not be tackled in an automated way at the hypergraph leve0. Therefore correctness proofs will usually 
require user- assigned semantics coupled with custom arguments, a task that may become challenging 
and/ or tedious for complex decompositions. In order to simplify the validation and therefore the design 
of new conformation spaces, we propose a simplified proof technique based on generating functions. 

Indeed, instead of specializing the hypergraph for each and every input sequence, one can delegate 
to the weight function the responsibility of weeding out conformations, e.g. by assigning them -\-oo en- 
ergetic contributions within MFE folding. Therefore each class of conformations can be seen as a family 
of conformation space {Dn}n>o (secondary structures, simple type pseudoknots. . .), to which one asso- 
ciates a family of hypergraphs {^n}n>Oy a decomposition, both indexed by the length n of the sequence. 

Let us remind that generating functions are formal power series that can be used to store various in- 
formation. For instance the counting generating function for the conformation space family Q) can be 
defined as 5® (z) = En>o \Dn\'Z^ where z is a formal complex variable devoid of intuitive meaning. Fur- 
thermore let be the set of F-Paths associated with then the counting generating function of the 
decomposition can be defined as S^iz) = T.n>o l^nl'Z^- Then the formal identity S^iz) = S^iz) implies 
that \Dn\ = |^^|,Vn > 0. It follows from basic set theory that unambiguity/ injectivity (resp. complete- 
ness/ surjectivity) of y/y in addition to the identity of generating functions, is in itself sufficient to prove 
the bijectivity of y/. Since reference generating functions are now available for many conformation space 
families |47|, this practically halves the burden of designing a proof. 

^ Algebraic Dynamic Programming partially addresses this issue, and the interested reader is referred to an early 
contribution by Reeder et al ^43J . 
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Fig. 2. Simplification of the Unaf old f32l decomposition of the secondary structures space. Framed states indicate 
origins of (hyper) arcs. 



4.2 RNA secondary structures 

Let us first illustrate our approach on RNA secondary structures, for which Unaf old f32] - the successor 
of MFold |56| - offers an unambiguous scheme. Compared to the original decomposition presented in 
Markham's thesis the one described in Figure[2]is simplified to ignore dangles. 



Proving unambiguity. 

- Let us remark that both and either leave their last base j unpaired (Left), or pairs it to / (Right). 
Furthermore these two cases are mutually exclusive. Finally generates exactly one helix. 

- Q always makes at least one call to and therefore creates at least one helix. Therefore, it either 
creates exactly one helix (Left case) or more (Right case), and these two cases are mutually exclusive. 

- distinguishes different types of loops. Let ms, be the numbers of unpaired bases on the 5' 
strand, 3' strand, and h be the number of helices starting from case Q^ one can label each of the cases 
and observe that they are mutually non- overlapping. Namely from left to right, we get the following 
(ms, ms, h) triplets: Interior loop (> 0, > 0, 1), stacking pair (0,0, 1), multiloop (> 0, > 0, > 1), bulges 5' 
(> 0,0,1) and 3' (0,>0,1), and hairpin loop (> 0,>0,0). 



Deriving completeness. From previous work by Waterman (541 , we know that the generating function 
of secondary structures with at least one unpaired base between paired bases {6 = 1) is 

l-Z-\-Z^- \/l-2z-z2_2^3 + ^4 

. (11) 

Following the general principle of the so-called DSV methodology (See Lorenz et al [29 1 for a pre- 
sentation in a similar context), the Unaf old decomposition can be translated into a system of algebraic 
equations. Namely, one simply replaces any occurrence of k unpaired base with z^, each basepair with 
z^, and any vertex with its associated generating function. Let Q^{z), Q{z), Q'{z) and Q^{z) be the gener- 
ating functions counting the F-paths generated from Q^, Q, Q' and respectively: 

Q\z) =Q\z) -z + Q^z) • Q'{z) Q{z) = Seq(z) • Q\z) + Q{z) • Q\z) Q\z) = z-Q\z) + Q'{z) 
Q'{z) =z^ • Seq+ {z) • Q'{z) • Seq+ {z) + z^ • Q'{z) + z^'Q{z)' Q'{z) 
+ z^'Q' {z) • Seq"" {z) + z^ • Seq"" {z)'Q^ {z) + Seq"" {z) 
Seq"^ [z) -z ' Seq(z) Seq(z) = z • Seq(z) L 

Solving the system yields Q^(z) = S{z) which, in conjunction with the unambiguity of the decomposition, 
proves its completeness. 



Application 


Algorithm 


Weight fun. Time Memory Ref. 


A - Energy minimization 

B - Partition function 

C - Base-pairing probabilities 

D - Statistical sampling (/c- samples) 

E - Moments of energy (Mean, Var.) 


Minimal weight 
Weighted count 
Arc -traversal prob. 
Weighted random gen. 
Moments extraction 


0{r?) 0{n^) (35) 
0{n^) Oin^) (35) 
0(n^ + k-n\oQn) O(n^) I12I41I 
0{n^) 0{n^) fSG) 


F - m-th moment of additive features 
G - Correlations of additive features 


Moments extraction 
Moments extraction 


g-RT" Oim^-n^) 0{m-n^) 
0{n^) 0{n^) 



Table 1. Reformulations of secondary structure applications as F-graphs problems and associated complexities. 



Applicability of generic algorithms. Let us show that J6' fulfills the prerequisites of our algorithms. First 
it is easily verified that ^ is an F-graph. Associating a region (resp. [1, /]) with each vertex q] ., qt / 

and q[. . (resp. ), one easily verifies that for any F-arc e e E the width of any region in the head h(e) is 
strictly smaller than that of the tail tie), and the acyclicity of ^ directly follows. Furthermore, any two 
vertices in the head h(e) have non- overlapping associated regions. Consequently ^ is independent, and 
a direct application of our generic algorithms gives a set of algorithms summarized in Table [T] This gives 
a family of efficient Oin^) algorithms for assessing RNA secondary structure properties at the Boltzmann 
equilibrium. 




Fig. 3. Alternative exhaustive strategies for interior loops. 



Remark 4: In interior loops, the set of F-arcs generated for the Q' case has apparent cardinality in Oin"^). 
This can be brought back to Oin^) by enforcing constraints on the energy function. Traditionally, the 
accepted practice is to bound the interior loop size (/ - j) + (/' - /) from above by a predefined constant 
K ^ 30. Exhaustive Oin^) decompositions can also be proposed (Figure |3) by decomposing the internal 
loop into additively- contributing regions. A first option may generate independently the left and right 
unpaired regions (Figure |3l Left), while an alternative may decompose internal loops into a symmetric 
loop followed by a fully asymmetric one (Figure[3l Right). 

4.3 Simple-type pseudoknots 

In his seminal work, Akutsu 1 1 1 focused on a subset of pseudoknots motifs, the simple- type pseudoknots, 
and proposed algorithms of complexity in Oin"^) for simple non-recursive pseudoknots in a basepair- 
maximisation energy model, and in O(n^) for recursive pseudoknots and loop-based energy models. 
However, the decomposition proposed in | LI is ambiguous, e.g. there exists different ways to create un- 
paired regions. Therefore we propose in Figure |4] an unambiguous decomposition for the same confor- 
mation space. 

Previous results. In a previous work |47 48 1, one of the authors showed that simple-type pseudoknots 
can be encoded by a simple formal language, in bijection with a context-free language. Here we focus on 
partly recursive simple pseudoknots presented in FigureS) They can be encoded by a well-parenthesized 
word p over two systems of parentheses {(/, /), (g, g)}, respectively indicating the leftmost and rightmost 
basepairs in FigureHl and an unpaired character c such that 




Fig. 4. An unambiguous decomposition for simple non-recursive pseudoknots that captures the Akutsu/Uemura 
class of pseudoknots. This decomposition yields Oin^)/©{n^) time/memory algorithms for partially recursive pseu- 
doknots and can be extended to include recursive pseudoknots and/ or Turner energy contributions in 0{n^) I ©{n"^). 

where k is some integral value, Xf^i Ui = n > 1, Xf^i rrii = m > 1, and p',p" are any two recursively- 
generated conformations. 

Completeness. Let us show that the decomposition in Figure|4]is complete, i.e. that any partially recur- 
sive pseudoknot can be generated by the decomposition. 

Let us initially focus on base-pairs and ignore unpaired bases. The smallest word within the language 
of Equation [12] is fp'gfp"g which can be generated by applying the initial case (Q Al ~^ A.m ~^ A 
p' ...g...g) followed directly by the terminal case [A^ At ^ f p' g f p" g). Moreover through a sequence 
A^ Ar^ Am Ay one adds an outermost edge around the right part g...g. So through m iterations 
of the sequence the decomposition generates any structure g^^ . . . g^^ . Similarly through a sequence 
A^ Al^ Am A one adds an outermost edge around the left part f ...f, and after n\ iterations any 
structure /^^ . . . /^^ is generated. Since these two sequences can be combined and alternated (starting 
with the initial case and finishing with the terminal case), then the decomposition generates any word 

p-rp' g"' n g"^ n- n p" g'"g- ds) 

For the recursive call p^ it is easily verified that Q* generates any (PK) structure. For p" it is worth men- 
tioning that, at a base-pairing level, A^ At (right base paired) and A^ cover all possible situations. 

Arbitrary numbers of unpaired bases c can also be inserted right before the opening / of a leftward 
base pair (resp. after closure / of a leftward base pair, after the opening g of a right base pair and before 
the closure g of a right base pair) by repeatedly applying the Al At (resp. Am Am, At At and 
Am Am) rule after adding a left (resp. right) base pair. Consequently any structure described by a word 
in Equation[T2]can be generated by the decomposition. 

Unambiguity. Let us now address the unambiguity of the decomposition, using our approach based on 
generating functions. Equation[T2limmediately gives a system of equations relating AU (z), the generating 
function of simple partially recursive pseudoknots, to S{z) the gen. fun. of all structures: 

^ ( Z \n ( Z \mi ( z \ni ( z \^A:-1 ( Z Z^5(z)^(l-z) 

^^(-)=Z(— ) (— ) •••(—) -^(-^-(— ) - ■ 

Now consider the dynamic programming decomposition illustrated by Figure H] Associating generating 
functions to each type of vertices and translating assigned bases into monomials, we obtain the following 
system of equations: 



Q\z) = z^ S{z) Ar{z) At{z) = zAt{z) + Am{z) Ar{z) = zAr{z) + Am{z) 

Am{z) = zAm{z) + A{z) A{z) = z^ Ar{z) + z^ At{z) + z^ S{z) At{z) = S{z) (1 - z) - I. 




Fig. 5. Unambiguous decomposition of fully recursive kissing hairpins. 



The last expression for Ariz) follows directly from the observation that any structure in Q can be written 
as a sequence of structures from interleaved with sequences of unpaired bases. Given that At cannot 
feature unpaired bases on its right end, one of the sequence of unpaired base must be removed. Further- 
more At does not generate the empty structure, so we have = (v4r(^) + !)/(!- z). Solving the system 
gives Q\z) = ^ [^l2z+^^^ ~ ^^^^^ unambiguity/ correctness of the decomposition directly follow. 



4.4 Fully- recursive kissing hairpins 

Kissing hairpins (KH) are pseudoknotted structure composed of two helices whose terminal loops are 
linked by a third helix. These pseudoknots are frequently observed, and are exhaustively predicted by 
Chen etal | 8 | in time complexity in O(n^), and in Oin^) / Oin"^) under restrictions by Theis etal |50|. Fig- 
ure[5]presents an unambiguous decomposition which generates the space of recursive kissing hairpins. 
Previous results. Again, an encoding of kissing hairpins can be found in earlier work by one of the au- 
thors (47 1, showing that any KH pseudoknot can be represented by a word p over three systems of paren- 
theses {(/, /), (g, g), ihy h)} (respectively denoting leftmost, central and rightmost helices) such that: 

p = (/S)^ (g5)^ (/S)^ ihS)hgSnhS)^~' h. (14) 

Completeness. First let us remark that the minimal conformation generated by the decomposition is 
Kl ^ Kr ^ K'j^ ^ Km fSgSfShSgSh. Remark that one can iterate arbitrarily over the states Kl 
K'^ ^ Kl, K'j^ ^ Kr ^ K'j^ and K'^ Km Km- Consequently one may insert patterns {Kl ^ K'^^ 
KlV-^ ^ {Sfr-^---{fsr-\ (4 ^Kr^ 4)^-1 ^ (/z5)^-i.-.(/zS)^-i and {Km - K'^ - Km^ ^ 
{gS)^~^"' {Sg)^~^ in the minimal word above, and produce any conformation denoted by 

/(S/)"-i5(gS)^-V5(/5)^"75/zS(/z5)^-ig(5g)^-i5(/zS)^-i/z 

where one recognizes the language of Equation[T4]upon simple expansion. 

Unambiguity. Equation [T4l allows to derive the generating function KH{z) of kissing-hairpin as a func- 
tion of S{z) the gen. fun. of all structures: 

KH{z) = X {zS{z))''{zS{z)r{zS{z)r{zS{z))^{zS{z)r{zS{z))^-^z= ^ 2 If.z.s ' ^^^^ 

n,m,k>\ ^^^^ ^ 

Now consider the dynamic programming decomposition illustrated by Figure [3 and translate it into a 
system of functional equation: 

K{z) = z^Kl{z)S{z) 

Kdz) = S{z)K[{z) + Kr{z) K[{z) = z^Kl{z)S{z) Km{z) = K'^{z)S{z) + S{zf 
K'^{z) = z''Km{z)S{z) Kr{z) = K'^{z)S{z) K'^{z) = z''Kr{z)S{z) + z''Km{z)S{z) 



Application 



Algorithm Weight fun. 



Time 



Memory Ref. 



Simple type pseudoknots (Akutsu&Uemura) 



A - Energy minimization 

B - Partition function 

C - Base-pairing probabilities 

D - Statistical sampling (/c- samples) 

E - Moments of energy (Mean, Var.) 

F - m-th moment of additive features 



Minimal weight 
Weighted count 

Arc -traversal prob. 

Weighted rand. gen. 

Moments extraction 

Moments extraction 



e RT 
e RT 
e RT 

-^bp 

e RT 

-^bp 

e RT 



0(^4) 



m 



0(^4) EE) in 0(^6) 



0(^^ + /c-^log^) 0(^^) 



0(^4) 



0{m ' 



Fully recursive Kissing Hairpins 



A- 


Energy minimization 


Minimal weight 


e^T' 


0{n') 


0(^^) 




B- 


Partition function 


Weighted count 


0{n^) 


0(^4) 




C- 


Base-pairing probabilities 


Arc -traversal prob. 


0[n^) 


0(^4) 




D- 


- Statistical sampling (/c- samples) 


Weighted rand. gen. 




0{n^ + k-n\og 


^n) 0(^4) 




E- 


Moments of energy (Mean, Var.) 


Moments extraction 




0{n^) 


0(^4) 




F- 


m-th moment of additive features 


Moments extraction 






Oim-n^) 





Table 2. Summary of ensemble based algorithms on simple pseudoknots and kissing hairpins, tt^^ stands for the 
simple Nussinov-Jacobson energy model, and for a Turner-like model based on loops contributions. 



Solving the system gives Kiz) = ^i^^i^f^^z^s = KH{z) and the unambiguity of the decomposition imme- 
diately follows. Again hypergraphs algorithms can be used, and specialize into the complexities summa- 
rized in Table [2j 

5 Extending the framework: Extraction of moments and exact correlations 

A last application addresses the extraction of statistical measures for additive features. Let us first define 
a feature as a function a:E^W^ extended additively over F-paths such that a{p) = Y.eep oc{e). One may 
then want to characterize the distribution of a random variable X = a(p), for p e ^ a random F-path 
drawn according to the weighted distribution. As it is not necessarily feasible to determine the exact 
distribution of X, one can examine statistical measures such as its 

Mean iix = E[X] and Variance Varx = E[X^] - 

e.g. from which the distribution is fully determined in the case of Gaussian distributions. Even when the 
distribution is not normal, it can still be characterized by a list of measures called moments of X, the 
m-th moment being defined as E[X^] = Epe^ a{p)^ - nip) I Ws. 

Moreover in the presence of multiple features (Xi := ai (p), . . . , Xj^ := ajt(p)), similar measures can be 
used to estimate their level of dependency. One such measure is the Pearson product-moment correla- 
tion coefficient pxi,X2 defined for two random variables as 

Covxi,X2 ^ E[Xi >X2] -E[Xi] >E[X2] 
^^'^^ y^Varxi-Varx2 y^Varxi •Varx2 

The correlation above involves the expectation of a product of two random variables which is an 
instance of a generalized moment, defined for the set of F-paths starting from 5 e y as 

Extracting such moments can be quite useful, allowing one to get access to average properties of struc- 
tures (#Hairpins, #Occurrences of pseudoknots. . . ) and their correlations within a weighted ensemble. 
For instance, Miklos et al t36J proposed an ©{m^ • n^) algorithm for computing the m-th moment of 



the Energy distribution for secondary structure in order to compare the distribution of free- energy in 
non- coding RNAs and random sequences. We are going to show how these generalized moments can be 
extracted directly through a generalization of the weighted count algorithm. 

Theorem 1. Let a := (ai, • • • , a^t) he a vector of additive features and m := (mi, • • • , m^t) he a k-tuple of 
natural integers. Then the pseudo-moment := E[X|^^ • • • X^^ \s]'Ws of a in a weighted distrihution can 
he recursively computed through 



m -\-\V\) • k' t^ 'Yl^^i mf^^\ time complexity and S [\V\ 'Yl^^-^ mi) memory where t"^ = max(5^^)e£(UI) 




is the maximal out-degree of an arc. 

Adding this new generic algorithms automatically creates new applications for each an every confor- 
mation space as summarized in Figure El This simultaneous extension - for all conformational spaces 

- of possible ensemble applications constitues in our opinion one of the main benefit of detaching the 
decomposition from its exploration. 

6 Conclusion and Perspectives 

In this paper, we established the foundation of a combinatorial approach to the design of algorithms 
for complex conformation spaces. We built on an hypergraph model introduced in the context of RNA 
secondary structure by Finkelstein and Roytberg |16|, which we extended in several direction. First we 
formulated classic and novel generic algorithms on Forward- Hypergraphs for weighted ensembles, al- 
lowing one to derive base-pairing probabilities, perform statistical sampling and extract moments of the 
distribution of additive features. Then we showed how combinatorial arguments based on generating 
functions could be used to simplify the proof of correctness for designed decompositions. We illustrated 
the full programme on classic secondary structures, simple type pseudoknots and fully- recursive kiss- 
ing hairpin pseudoknots for which we provided decompositions that were proven to be unambiguous 
and complete with respect to previous work. The hypergraph formulation of the decomposition, cou- 
pled with the generic algorithms, readily gave a family of novel algorithms for complex - yet relevant - 
conformation spaces. 

Let us mention some perspectives to our contribution. Firstly the principles and algorithms described 
here could easily be implemented as a general compiler tools for F- Graphs algorithms. Such a compiler 
could be coupled with helper tools expanding hypergraphs from succinct descriptions, such as context- 
free grammars (related to ADP 1 19 |), or M. Mohl's split types |37|. More complex search space could also 
be modeled, such as those relying on a more detailed representation of RNA structure (e.g. MCFold's 
NCMs 1 40 1), those capturing RNA-RNA interactions |2 24 1, those offering simultaneous alignment and 
folding (Sankoff's algorithm |46|) or performing mutations on the sequence | 53 1. Finally our hypergraph 
framework is not necessarily limited to polynomial algorithms, and algorithmic developments could be 
proposed to address some of the current algorithmic issues in RNA (inverse folding |3 |, kinetics |49]) 
for which no exact polynomial algorithms are currently known (or suspected). More generally it is our 
hope that, by simplifying and modularizing the process of developing new - algorithmically tractable 

- conformation spaces, our contribution will help design better, more topologically- realistic 152 128I44I . 
energy and conformational spaces to better understand and predict the structure(s) of RNA. 
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